Dataset description

Home Equity Line of Credit (HELOC) Dataset

Our dataset consists of 10,460 anonymized data points of real homeowners credit applications. There are 24 variables of credit bureau data, most of them are specific financial score information related to client past behaviour such as:

  • Number of Installment Trades with Balance
  • Months Since Most Recent Delinquency

Main dificulties

Different NA meaning

There are three various NA levels in the dataset. The lack of values have different reasons:

  • No Bureau Record or No Investigation (NA_9)
  • No Usable/Valid Trades or Inquiries (NA_8)
  • Condition not Met (e.g. No Inquiries, No Delinquencies) (NA_7)

Specific categorical values

There are two columns of already preprocessed data: MaxDelq2PublicRecLast12M, MaxDelqEver. These are numerical values that were transformed to categorical ones. The meaning is a little bit confusing, some of the levels are ordered and some of them are not.

Problem introduction

The aim of the project is to construct a model predicting RiskPerformance binary variable with two levels: Good or Bad basing on what we have.

Introductory cleaning

This report was written with the goal of analysing HELOC dataset and preparing it to applying machine learning algorithms. The most important information that needs to be clarified beforehand is that the original dataset was randomly split into three separate datasubsets (training, tuning and testing) in 3:1:1 proportions. As the original data had nearly 10 000 observations, one can easily compute rough number of rows in each datasubset.

We begin with reading .csv file with training set. This is the only one we will be analysing in this report, as we should have no prior knowledge about the other two before using them in tuning/testing respectably.

df <- readr::read_csv("dataset/df_train.csv", col_types = cols(
  RiskPerformance = col_factor(levels = c("Good", "Bad")),
  ExternalRiskEstimate = col_double(),
  MSinceOldestTradeOpen = col_double(),
  MSinceMostRecentTradeOpen = col_double(),
  AverageMInFile = col_double(),
  NumSatisfactoryTrades = col_double(),
  NumTrades60Ever2DerogPubRec = col_double(),
  NumTrades90Ever2DerogPubRec = col_double(),
  PercentTradesNeverDelq = col_double(),
  MSinceMostRecentDelq = col_double(),
  MaxDelq2PublicRecLast12M = col_double(),
  MaxDelqEver = col_double(),
  NumTotalTrades = col_double(),
  NumTradesOpeninLast12M = col_double(),
  PercentInstallTrades = col_double(),
  MSinceMostRecentInqexcl7days = col_double(),
  NumInqLast6M = col_double(),
  NumInqLast6Mexcl7days = col_double(),
  NetFractionRevolvingBurden = col_double(),
  NetFractionInstallBurden = col_double(),
  NumRevolvingTradesWBalance = col_double(),
  NumInstallTradesWBalance = col_double(),
  NumBank2NatlTradesWHighUtilization = col_double(),
  PercentTradesWBalance = col_double()
))

General look at variables

We had some knowledge of the data before, so we knew that – other than the target column RiskPerformance – every variable is numeric. As such, we can have a look at pair plot, where every combination of two variables is displayed.

Pairs plot of analysed dataset

Pairs plot of analysed dataset

At the first glance one can easily notice that this plot is not very readable. In the folder containing this report, however, one can find pairs_plot.png with original resolution that may be enlarged and actually read.

Missing data

We didn’t use the obvious choice of checking missing data statistics with built-in functions of DataExplorer library because we knew that NA-s are encoded in this dataset differently. To clarify the issue, all non-missing data here is expressed with non-negative numbers, while there are three kinds of NA-s, each having its own negative value of -7, -8 or -9 and different meaning. They also appear at different frequencies thorough the variables, though they actually sum up to similar number each.

It may be valuable to take a look at the distributions of NA-s across variables, so there’s a plot to help with illustrating the issue.

NA-s split by type and variable

NA-s split by type and variable

Note that MSinceMostRecentDelq column shows the highest fraction of complicated NA structure.

Now that we have seen a graphical representation, let’s move on to the description, which will say us a bit more about the meaning of each NA code. As we said before, -9 means no credit history and/or score available and can be considered the “true” NA, as it is a typical example of no data. -8 is used for cases when there is no usable data for this record, e.g. if this person was inactive for a year. Then, -7 is used when condition is not met, as in someone have never had an inquiry before the last 7 days (second row, last column in the plot above) and we ask for the days since such an event.

There are rows that contain only -9s (not counting target column), so it’s not very informative. Let’s drop these rows.

df <- df[!apply(df[, -1], 1, function(x) all(x == -9)), ]

Now we need to check in which columns there are -9s left.

names(which(apply(df[, -1], 2, function(x) any(x == -9))))
## [1] "ExternalRiskEstimate"

Luckily, only one column contains -9s after our first cleaning step. These are records for which nobody ever made a risk estimate. They are not very numerous (only 5 cases, actually, less than 0.1% of the data), so we can make an assumption that these estimates would be somewhere around median. The code below substitutes median value of this column for aforementioned cases.

df[df$ExternalRiskEstimate == -9, 2] <- median(df$ExternalRiskEstimate, na.rm = TRUE)
NA-s with -9s cleaned

NA-s with -9s cleaned

Now the plot looks similar to the previous, but we skipped -9s, as there are none left in the dataset. Non-zero percentages grew a bit, because we deleted some rows, but overall proportions remained the same as before.

Overlapping columns

Remember that gigantic plot of every possible column pair? There was more than only points on grahps. It contained calculated correlation between these pairs and we can use it now to decide which columns to drop. But first, let’s plot it again after some cleaning, as the results might have been altered a bit.

Updated pairs plot

Updated pairs plot

knitr::include_graphics("good_plots/zoom.gif")

Updated pairs plot Updated pairs plot Updated pairs plot

If we look closely and thoroughly, we can find two pairs of columns with high correlation values. Their names are also an indicator of correlation. After several rounds of rock-paper-scissors a decision was made to: * exclude NumTrades60Ever2DerogPubRec and leave NumTrades90Ever2DerogPubRec (0.897 correlation), * exclude NumInqLast6M and leave NumInqLast6Mexcl7days (0.992 correlation).

df <- drop_columns(df, c("NumTrades60Ever2DerogPubRec", "NumInqLast6M"))

Back to other NAs

We got rid of -9s in our dataset, but there are still two other NA codes. Before treating them, we need to remind us of which columns contain our NAs.

(na_columns <- names(which(apply(df[, -1], 2, function(x) any(x %in% c(-7, -8))))))
## [1] "MSinceOldestTradeOpen"             
## [2] "MSinceMostRecentDelq"              
## [3] "MSinceMostRecentInqexcl7days"      
## [4] "NetFractionRevolvingBurden"        
## [5] "NetFractionInstallBurden"          
## [6] "NumRevolvingTradesWBalance"        
## [7] "NumInstallTradesWBalance"          
## [8] "NumBank2NatlTradesWHighUtilization"
## [9] "PercentTradesWBalance"

What seemed important to us was the impact of these 9 variables with NA on target variable, so we made use of boxplots to get some idea.

Impact of incomplete variables on target

Impact of incomplete variables on target

histogram of columns with NA

histogram of columns with NA

Basing on the plots above and correlation plot between all variables (just a moment before) we decided to drop NumInstallTradesWBalance column which had about 10% of -8s and almost identical boxes regardless of chosen RiskPerformance category.

df <- drop_columns(df, "NumInstallTradesWBalance")
na_columns <- na_columns[na_columns != "NumInstallTradesWBalance"]

Conversion of a column with multiple NA types into two columns

Encoding of missing data used in our dataset is, honestly speaking, quite inconvenient. We decided to remedy that by splitting necessary columns into two separate ones:

  • old numerical column with one general NA
  • new logical column dependent on type of NA

The second column is created with answering following question in mind: “Does considered entity exist?”. As such, we encoded it with FALSE for -7s, as it indicates impossibility of finding requested entity, and TRUE for -8s and non-missing data. We would also set NAs in this column for -9s, but there were none left by this time. Quite obviously, we dropped this column every time it contained only truths, as it would tell us nothing really.

This type of conversion was applied on two columns with mixed NAs – MSinceMostRecentInqexcl7days and MSinceMostRecentDelq – and six more with -8s only.

MSinceMostRecentDelq MSinceMostRecentInqexcl7days
22 -7
-7 0
0 2
32 0
-7 0
3 4
df <- na_process(df, "MSinceMostRecentDelq", "HadDelq")
df <- na_process(df, "MSinceMostRecentInqexcl7days", "HadInqexcl7days")
df <- na_process(df, "MSinceOldestTradeOpen")
df <- na_process(df, "NetFractionRevolvingBurden")
df <- na_process(df, "NetFractionInstallBurden")
df <- na_process(df, "NumRevolvingTradesWBalance")
df <- na_process(df, "NumBank2NatlTradesWHighUtilization")
df <- na_process(df, "PercentTradesWBalance")
MSinceMostRecentDelq HadDelq MSinceMostRecentInqexcl7days HadInqexcl7days
22 TRUE NA FALSE
NA FALSE 0 TRUE
0 TRUE 2 TRUE
32 TRUE 0 TRUE
NA FALSE 0 TRUE
3 TRUE 4 TRUE

Coefficient of determination

vtreat R^2 for all variables. r^2

Appendix

Data Profiling Report

Basic Statistics

Name Value
Rows 5,937
Columns 24
Discrete columns 1
Continuous columns 23
All missing columns 0
Missing observations 0
Total observations 142,488
Memory allocation 1.1 Mb

Data Structure

Missing Data Profile

Univariate Distribution

Histogram

Bar Chart (by frequency)

Correlation Analysis

Principle Component Analysis